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1. INTRODUCTION 

In present world, there is huge widening of data due to storage, transfer, sharing of structured and 
unstructured data which inundates to business. E-commerce sites and application produce huge and complex 
data which is termed as Big Data. It is mature term that evoke large amount of unstructured, semi-structured 
and structured data. The cloud computing furnish with the stable platform for vital, economical and efficient 
organisation of data for operating it. 

In order to handle and store these huge data, a large database is needed. To cope up with largescale 
data management system (DBMS) would not support the system. Relational databases were not capable with 
the scale and swiftness challenges that face modern applications, nowhere they built to take benefit of the 
commodity storage and computing the power available currently. 

NoSQL is called as Not only SQL as it partially supports SQL. These data stores are rapidly used in 
Big Data and in many web applications. NoSQL is basically useful for the data which is unstructured to store. 
Unstructured data is growing rapidly than structured data and does not fit the relational schemas of RDBMS. 
Hence the NoSQL [1] data stores get introduced with high availability, high scalability and its consistency. 
NoSQL database is widely used to process heavy data and web application. 

Nowadays most of the companies are shifting to NoSQL database [1-3] for their flexibility and 
ability to scale out, to handle bulky unstructured data in contrast with relational database. NoSQL cloud data 
stores are developed that are document store, Key-value, column family, graph database, etc. NoSQL data 
stores comprise its advantages for coping with the vast load of data with the aid of scale out applications. 

The techniques which are in use are classified into static [4-5] and dynamic partitioning [6] systems. 
In static partitions, the related data item are put on single partition for accessing the data, and once the 
partitions formed do not change further. The advantage of static partition creation, no migration of data is 
done so as the cost of data migration is negligible. 


Journal homepage: http:/iaescore.com/online/index.php/IJAAS 


22 o ISSN: 2252-8814 


The dynamic partition system, the partitions are formed dynamically in which the partitions changes 
frequently so as to reduce the distributed transaction. As the partitions changes, the chances of migrating the 
data is high so as the cost of migration. 

Taking into consideration of the pros and cons of the static and dynamic partitioning systems, 
scalable workload driven data partitioning technique is derived. The main aim of this techniques is to reduce 
the distributed transaction, making the database scalable and also the performance the application to get 
improved. The scalable algorithm tracks the data access pattern that is which warehouse is supplying to 
which other requested warehouse and also the transaction logs are analysed. The proposed system frames, the 
partition which are formed uses NoSQL database that is MongoDB using this scalable workload-driven 
technique which fall neither under static nor dynamic system. The transaction logs and data access patter are 
monitored and partitions are formed periodically. 

The essential contributions of this paper are structured as follows: 

a. The design of scalable workload-driven partitioning [2] which are stand on data access pattern and traces 
the logs, are studied and implemented in MongoDB by forming 5 partitions. 

b. The TPC-C 9 tables are mapped into different 9 collections in MongoDB and transaction are carried out 
on 5 partitions which are statically placed. This static approach increases the distributed transaction and 
the performance of the application is decrease. 

c. The TPC-C 9 tables are then mapped into a single collection, the scalable workload-driven technique is 
used to partition the data across the 5 partition and transactions are carried over those partitions. These 
will reduce the distributed transaction. The performance in the terms of response time is low and 
throughput of the system is high as compared to above case. 

d. The results of both above case are taken on local machine and also on EC2 instance to check the 
performance over the cloud. 

The rest of this paper is as follows. The section 2 is the background of the paper consist of the 
related work done by the researches are explained in brief. The section 3 describes the central idea of the 
work done which includes design of the scalable workload driven algorithm is described. Also, the 
architecture of the proposed system. Mapping of the TPC-C tables into MongoDB collections are explained 
in the section 4. Following section 5, with the implementation of the work. Section 6 states the results of the 
work done. Finally, section Conclusion, finalize the paper. 


2. BACKGROUND 

Data partitioning means physically partitioning the database which will help to scale out the 
database to get available all the time. A lot of work is done on the metrics for data partitioning to give the 
high performance of the application to be scalable and restrict the transactions on a single partition. Some of 
them are listed below. 

The prototype is built with benchmark tool TPC-C which uses OLTP transaction for web 
applications. These OLTP transaction requires quick response from the applications in recent times. TPC-C 
benchmark is a popular benchmark which is an Online Transaction processing workload for estimating the 
performance on different hardware and software configuration. 

The originator Sudipto Das open up with the technique ElasTraS [4] which express Schema level 
partitioning for gaining scalability. The intent behind schema level partitioning is to collect alike data into the 
same partition, as the transactions only access the data which is needed from a large database. A major goal 
of ElasTraS is to have elasticity and also to reduce cost operation of the system during failure. 

The author Cralo Curino has put forward, Schism: A Workload-Driven Approach to Database 
Replication and Partitioning [7] to improve the scalability of shared nothing distributed databases. It intends 
to deprecate the distributed transactions while making balanced partitions. For transactional loads graph 
partitioning technique is used to balance the data. Data items which are accessed in graph partitioning by the 
transactions are kept on a single partition. 

J. Baker et al., presented Megastore [5] in which data is partitioned into a compilation of entity 
groups. An entity group is a selection of related data items and is put on a single node so that the data items 
required for enhancing the approach are accessed from a single node. Megastore aims to make the system to 
have: Megastore provides synchronous replication but delays the transaction. 

The author Xiaoyan Wang has presented, Automatic Data Distribution in Large-scale OLTP 
Applications [8]. The data is divided into two categories original data and incremental data. For original data 
that is old data, BEA (Bond Energy Algorithm) is applied on it and for incremental data that is progressive 
data, online partitioning will be invoked where partitions are formed on the base of KNN (k-Nearest 
Neighbour) clustering algorithm. Data placements allocate these data to the partitions by genetic algorithm. 
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The author Francisco Cruz put forward Table Splitting Technique [1] which considers the system 
workload. A relevant splitting point is the point that split the region into two new regions with likely loads. 
The split key search algorithm satisfies the above statement. The algorithm estimates the splitting point when 
it receives the key from the first request of each region. For each request, if the split key differs then 
algorithm changes the splitting point. 

The author Curino, suggested the Relational Cloud [9] in, which scalability is reached with the 
workload-aware approach termed as graph partitioning. In graph partitioning, the data items, which are 
frequently accessed by the transactions are kept on a single partition. Graph-based partitioning method is 
used to spread large databases across many machines for scalability. The notion of adjustable privacy 
showing the use of different levels of encryption layered can enable SQL queries to be processed over 
encrypted data. 

The author Miguel Liroz-Gistau [6] has proposed a divergent way of dynamic partitioning technique 
called DynPart and DynPartGroup algorithm in Dynamic Workload-Based Partitioning Algorithms for 
Continuously Growing Databases, for efficient data partitioning for incremental data. The problem with static 
partitioning is that each time a new set of data arrives and the partitioning is redone from scratch. 

The authors Brian Sauer and Wei Hao have presented [10] a different way of data partition using the 
data mining techniques. It is the methodology for NoSQL database partitioning which depends on data 
clustering of database log files. The new algorithm has been built to overcome k-means issue that is the 
detection of oddly shaped data, by using minimum spanning tree which is effective than k-means 


3. PRESENT THE CENTRAL IDEA OF THE WORK 
3.1. Design of Scalable Workload Driven Partitioning in Mongodb 

The proposed system considers the mapping of TPC-C schema into MongoDB collections for the 
improvement of the performance. In this partitioning strategy, transaction logs and data access pattern are 
monitored. The data access pattern are analysed such as which warehouse is more prone to supply the 
requested warehouse. That means when customer place an order, and that order is satisfied by warehouse 
present on a partition but the item is out of stock and that transaction is fulfilled by another warehouse from 
another partition. This behaviour of serving the requested warehouse is tracked and patterns are formed. 
Based on these two factors the partitions are formed. 


3.2. Scalable Workload Driven Partitioning Algorithm 

The architecture of scalable workload driven algorithm [2] gives overview of the project. The 
database which needs to be partitioned contains data items of local and remote warehouses in which local 
house will represent requested warehouse and the remote warehouse will represent the supplier warehouse. 
The algorithm is then applied on the database and shards are formed. Hence which will restrict the 
transaction to a single partition and the performance and the throughput of the application will increase. The 
algorithm is neither static nor dynamic, it lies between them and partitions are restructured as per need, by 
referring the transaction logs and access patterny. 
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Figure 1. Architecture of the work flow 
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3.3. Definitions of Terms In The Algorithm 
3.3.1. Load 

The load of the partition [2] which is calculated in the algorithm, interprets the number of 
transactions executed on the each warehouse, and the total load of the partition is calculated by adding the 
load on each warehouse. The mean of the load is calculated to perform standard deviation on the partition 
which will define how much is the division of the load from the average load of the partition. 


3.3.2. Association 

The association of the partition [2] is calculated in algorithm, interprets the number of local 
transaction and distributed transaction executed on the partition. Local transaction means the transaction 
which are fulfilled by the requested warehouse only and the distributed transaction means the request is 
fulfilled by the supplier warehouse where the requested warehouse was out of stock. For example, the 
customer is requesting data on wl warehouse of partition A but as there is no stock, the request is completed 
buy another w two warehouse of partition B. 

In the scalable workload driven algorithm, the input contains number of partitions to be formed, 
number of the warehouse and transaction data. The output of this algorithm gives the optimised partitions. 
The process of the algorithm starts by distributing the warehouse statically into the partition and 
combinations of the partition and warehouse are formed with the help of genetic algorithm which will give 
the optimized combinations of it. 

Later on, the calculation of the load on the each warehouse is calculated which will then sum up and 
give the entire load on the partition by using standard deviation. Then the load is sorted in ascending order. 
The association of the partition is too calculated and sorts it in descending order. A summation of both the 
load rank and association rank are computed and sort it in ascending order and top 5 combinations are 
selected as the partitions which will have optimised load balance and association. The below Figure 2 
explains the scalable workload-driven algorithm [2]. 


1. Start with static distribution. 5. foreach transaction do 
requester warehouse: 
supplier_warehouse: 
End 
_ Sort partition Association in 
descending order(): 
repeat 
_ Rank Value = » (partition Load 
, partition Association ) 
until end; 
A (LD) = VE. [(Le,-LDmean)*/s]; . Sortrank value ascending order(): 
end 


4. Sort partition Load in ascending 


2. Combination(partition, warehouse): 
3. Foreach partition do 


for each warehouse do 


calculate Lwi (); 

pk = 5. Lwi Z 

End 

LDmean = y Lphs ; 


9. Select top 5 combinations 
10. Select the top combination as the 
order(A (LD), s); best combination with effective load 
balancing and association. 


s= total number of partition 


DDwsan= average of transactions executed on the all 
n= total number of warehouse 


= et the partitions. 
Ipk= no. of transactions executed on the partition K. å (LD}=Standard deviation of load 





Figure 2. Scalable workload-driven algorithm 


3.4. Mapping Of The Tpc-C Schema Into Mongodb 

In this, mapping of TPC-C schema into the data model of the MongoDB is performed. There are 
total nine tables as a district, customer, stock, warehouse, orders, new-order, order-line, item, and history in 
the TPC-C schema. These nine tables are mapped to a single collection in MongoDB. Figure 3 shows the 
mapping of TPC-C schema to MongoDB. The history table has not been considered while creating the 
MongoDB collection. As the transaction gets triggered the hunting of for the particular data into the single 
collection that is single partition will perform better instead of searching the data into different 9 collections. 
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Using the scalable workload driven algorithm and mapping into single collection in MongoDB the partitions 
are formed. The reason for creating a single collection for all the 9 tables is to minimize the response time for 
retrieving the results. 





Figure 3. Mapping of TPC-C schema 


4. IMPLEMENTATION 

The implementation of the project considers two metric that is response time and throughput. It was 
performed on local machine and on cloud and the difference of their response time and throughput is 
measured. The below table specifies the configuration of the machines which were used for the experimental 


purpose. 


Tablel. Configuration of local machine Table2: Configuration of cloud instance 
Configuration of Local Machine Configuration of Cloud Instance 
Edition : Windows 7 Edition : Windows 10 
RAM : 8.00 GB RAM : 16.00 GB 
Storage : 1GB Storage : 30GB 
Processor : Intel Core (13) System Type : 64bit 
System Type : 64 bit Cost : $ 0.263/hr 
5. RESULTS 


Response time and throughput are calculated for on local machine and on cloud. Below are the 
graphs representing 15 warehouses, 25 warehouses and 35 warehouses. The number of users vary from 200 
to 1000. The purpose of this experiment was to validate the scalability of the stated partitioning scheme with 
the increasing number of concurrent users. Figure 4-15 shows the response time with 15 warehouses, 25 
warehouses and 35 warehouses on local and on cloud. 
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Figure 4. Response Time with 15 warehouses Figure 5. Response Time with 15 warehouses 
on local on cloud 
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Figure 6. Throughput with 15 warehouses on local Figure 7. Throughput with 15 warehouse on cloud 
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Figure 8. Response Time with 25 warehouses Figure 9. Response Time with 25 warehouses 
on local on cloud 
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Figure 10. Throughput with 25 warehouses on local Figure 11. Throughput with 25 warehouses on cloud 
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Figure 12. Response Time with 35 warehouses Figure 13. Response Time with 35 warehouses 
on local on local 
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Figure 14. Throughput with 35 warehouses on local Figure 15. Throughput with 35 warehouses on cloud 
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6. CONCLUSION 

Scalable workload-driven partitioning is implemented using MongoDB to satisfy the demands of 
latest cloud related applications. The experiment performed with the solution of using the workload-driven 
algorithm is validated over the local machine and also on the cloud. The use of EC2 cloud instance improves 
the metrics used for validation. By implementing the concerned scheme using the benchmark TPC-C, it has 
been observed that scalable workload driven partitioning reduces the number of distributed transactions and 
gives better response time as compared to TPC-C. 
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